loading the data and the necessary libraries :
library(readr)
library(lawstat)
library(tidyverse)
library(highcharter)
library(reshape2)
library(scales)
library(car)
library(ggthemr)
library(Hmisc)
library(corrplot)
library(gridExtra)
train <- read.csv("/Users/deyapple/Documents/Courses/Term04/DA/house/train.csv",
stringsAsFactors = FALSE) %>%
mutate(houseAge = -1 * (YearBuilt - YrSold),
grAge = -1 * (GarageYrBlt - YrSold))
test <- read.csv("/Users/deyapple/Documents/Courses/Term04/DA/house/test.csv",
stringsAsFactors = FALSE) %>%
mutate(houseAge = -1 * (YearBuilt - YrSold),
grAge = -1 * (GarageYrBlt - YrSold))
ggthemr('earth', type = 'inner')Q1 :
Choosing quantative Variables
At this early stage, I have only considered quantitive variables. In the later sections I will add a few qualitative predictors to the model as well. Some variables are masked as quantitative, when in nature they are actually qualitative. I have removed these variables too, such as Month Sold.
quan.df <- train %>%
select_if(is.numeric) %>%
select(-YearBuilt, -YrSold, -GarageYrBlt, -MoSold, -MSSubClass, -Id)Correlation Matrix
Next, I have constructed a correlation matrix from which, using ggplot2 I have drawn a Heatmap. (I have removed the code part to keep the document tidy)
clearer visualization :
Correlation test
Next I have made a dataframe consisting of the correlations and their \(p-values\) under the null hypothesis that the true correlation is zero.
pvalues <- quan.cor_and_p$P
cor_and_p <- flattenCorrMatrix(correlations, pvalues)
cor_and_p %>%
filter(cor < 1) %>%
filter(column == "SalePrice") %>%
arrange(desc(cor)) %>%
head(50) row column cor p
1 OverallQual SalePrice 0.79098159 0.000000e+00
2 GrLivArea SalePrice 0.70862448 0.000000e+00
3 GarageCars SalePrice 0.64040917 0.000000e+00
4 GarageArea SalePrice 0.62343144 0.000000e+00
5 TotalBsmtSF SalePrice 0.61358052 0.000000e+00
6 X1stFlrSF SalePrice 0.60585219 0.000000e+00
7 FullBath SalePrice 0.56066376 0.000000e+00
8 TotRmsAbvGrd SalePrice 0.53372318 0.000000e+00
9 YearRemodAdd SalePrice 0.50710094 0.000000e+00
10 MasVnrArea SalePrice 0.47749305 0.000000e+00
11 Fireplaces SalePrice 0.46692884 0.000000e+00
12 BsmtFinSF1 SalePrice 0.38641980 0.000000e+00
13 LotFrontage SalePrice 0.35179910 0.000000e+00
14 WoodDeckSF SalePrice 0.32441345 0.000000e+00
15 X2ndFlrSF SalePrice 0.31933379 0.000000e+00
16 OpenPorchSF SalePrice 0.31585622 0.000000e+00
17 HalfBath SalePrice 0.28410769 0.000000e+00
18 LotArea SalePrice 0.26384336 0.000000e+00
19 BsmtFullBath SalePrice 0.22712223 0.000000e+00
20 BsmtUnfSF SalePrice 0.21447910 0.000000e+00
21 BedroomAbvGr SalePrice 0.16821316 9.927481e-11
22 ScreenPorch SalePrice 0.11144657 1.972139e-05
23 PoolArea SalePrice 0.09240355 4.073492e-04
24 X3SsnPorch SalePrice 0.04458367 8.858169e-02
25 BsmtFinSF2 SalePrice -0.01137812 6.639986e-01
26 BsmtHalfBath SalePrice -0.01684415 5.201537e-01
27 MiscVal SalePrice -0.02118958 4.184863e-01
28 LowQualFinSF SalePrice -0.02560613 3.282073e-01
29 OverallCond SalePrice -0.07785589 2.912352e-03
30 EnclosedPorch SalePrice -0.12857796 8.255763e-07
31 KitchenAbvGr SalePrice -0.13590737 1.860428e-07
Top 10 Highest Correlations
Finally I have selected the variables with the highest correlations. For now, these will be our predictors.
top10 <- quan.cor.melt %>%
filter(Var1 == "SalePrice") %>%
filter(Var2 != "SalePrice") %>%
arrange(desc(value)) %>%
head(10)In the data frame from the last section, if we search for the \(p-value\) of these variables, we will see that the \(p-values\) are very low, which results in the rejection of the null hypothesis : There is no correlation between SalePrice and this variable.
Q2 :
Drawing The Scatter Plot
Selecting the top 10 predictors from the data frame:
quan.predictors <- top10 %>%
select(Var2) %>%
unname() %>%
unlist() %>%
as.character()
var.nm <- c(quan.predictors, c("SalePrice"))
quan.df.sel <- quan.df %>%
select(one_of(var.nm))Draw two by two scatter plot :
scatterplotMatrix(quan.df.sel)Determining Colinearity
Assembling the correlation matrix :
(Again, I have’nt included the code in order to keep the document tidy)
It makes sense for these variables to be colinear. Without any statistical knowledge, it is intuitively obvious that an increase in GarageCars would result in an increase in GarageArea. Later on, if any of these predictors still exist in the model, I will use the vif() function to determine whether they are colinear or not.
Q3:
quan.df.reg <- lm(SalePrice ~ GarageArea + YearRemodAdd +
GarageCars + TotRmsAbvGrd + FullBath +
GrLivArea + X1stFlrSF +
TotalBsmtSF + OverallQual + MasVnrArea, data = quan.df,
na.action = na.exclude)
quan.df.reg.sum <- summary(quan.df.reg)
quan.df.reg.sum
Call:
lm(formula = SalePrice ~ GarageArea + YearRemodAdd + GarageCars +
TotRmsAbvGrd + FullBath + GrLivArea + X1stFlrSF + TotalBsmtSF +
OverallQual + MasVnrArea, data = quan.df, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-481370 -18855 -1092 16075 295028
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.311e+05 1.165e+05 -7.995 2.63e-15 ***
GarageArea 1.235e+01 1.029e+01 1.200 0.23024
YearRemodAdd 4.330e+02 6.005e+01 7.210 9.00e-13 ***
GarageCars 1.237e+04 2.998e+03 4.126 3.90e-05 ***
TotRmsAbvGrd -2.601e+02 1.113e+03 -0.234 0.81525
FullBath -1.813e+03 2.552e+03 -0.710 0.47755
GrLivArea 4.387e+01 4.156e+00 10.557 < 2e-16 ***
X1stFlrSF 1.287e+01 4.915e+00 2.617 0.00896 **
TotalBsmtSF 2.134e+01 4.238e+00 5.036 5.36e-07 ***
OverallQual 1.992e+04 1.167e+03 17.069 < 2e-16 ***
MasVnrArea 3.774e+01 6.272e+00 6.018 2.23e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 37760 on 1441 degrees of freedom
(8 observations deleted due to missingness)
Multiple R-squared: 0.7747, Adjusted R-squared: 0.7732
F-statistic: 495.6 on 10 and 1441 DF, p-value: < 2.2e-16
Q4:
Performing the Prediction:
quan.df.pred <- quan.df %>%
mutate(SalePricePred = fitted(quan.df.reg, type = "response"),
SalePriceError = residuals.lm(quan.df.reg)) Predicted Price Vs. Actual Price:
one.pa <- ggplot(quan.df.pred, aes(x = SalePrice, y = SalePricePred)) +
geom_point() +
geom_smooth()
one.paquan.df.predh <- quan.df.pred %>%
filter(!is.na(SalePrice)) %>%
filter(!is.na(SalePricePred))
hchart(type = "area", density(quan.df.predh$SalePrice), name = "Actual") %>%
hc_add_series(type = "area", data = density(quan.df.predh$SalePricePred), name = "Prediction") %>%
hc_add_theme(hc_theme_db())As we can see, the predicted price and the actual price are very close in some instances. That is, the scatter plot resembles the line \(x = y\) (the bisector of the first and third segment)
Error Density Plot:
one.ed <- ggplot(quan.df.pred) +
geom_histogram(aes(y = ..density.., x = SalePriceError), alpha = 0.5, bins = 150) +
geom_density(aes(x = SalePriceError), size = 0.5)
one.ed As we can see the error chart is right-skewed bell shape, indicating a somewhat normal distribution for the error.
Q5:
The RSE (Residual Standard Error) of the model is computed as below : \(RSE = \sqrt{\frac{RSS}{n - 2}}\) where RSS is the Residual Sum of Squares of the model. Since RSE is based on the scale of Y (the response variable) it does not give a clear view on whether the model is a good fit or not. In order to test whether the model is a good fit for the data, we use the \(R-Squared\) value, which is computed as follows: \(R^2= \frac{TSS - RSS}{TSS}\) where \(TSS = \Sigma{(y_i - \bar{y})^2}\) is the Total Sum of Squares. TSS measures the total variance in Y (the response variable) before performing the regression. RSS measures the amount of variability that is left unexplained by the model after the regression if performed. therefor \(R^2\) is a measure between 1 and 0 indicating how much of the variability in the response is explained by our regression model. It differs from RSE since it is always between 1 and 0 and therefor independent of the scale of Y.
quan.df.reg.sum$r.squared[1] 0.774743
As we can see, the \(R^2\) value is relatively high, therefor most of the variance in the response is explained by our regression model.
The \(F-statistic\) is our answer to the following question: Is there a relationship between the repsonse and at least one of our predictors? In other words, we want to test the null hypothesis: \[H_0 : \beta_0 = \beta_1 = ... = \beta_p = 0\] versus the alternative hypothesis : \[H_a: {at~least~one~\beta_j~is~non-zero}\] to perform this hypothesis we must compute the \(F-statistic\), \[F = \frac{\frac{TSS - RSS}{p}}{\frac{RSS}{n - p - 1}}\] where \(p\) is the number of predictors and \(n\) is the sample size. Based on a F-distribution table, we can find out the minimum value of F required for us to reject the null hypothesis under differnet values of \(\alpha\).
quan.df.reg.sum$fstatistic value numdf dendf
495.6137 10.0000 1441.0000
As we can see from the relatively large value of the \(F-statistic\) and the corresponding low \(p-value\), we can reject the null hypothesis that there is no relationship between the response and any of the predictors.
Q6:
quan.df.reg.sum
Call:
lm(formula = SalePrice ~ GarageArea + YearRemodAdd + GarageCars +
TotRmsAbvGrd + FullBath + GrLivArea + X1stFlrSF + TotalBsmtSF +
OverallQual + MasVnrArea, data = quan.df, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-481370 -18855 -1092 16075 295028
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.311e+05 1.165e+05 -7.995 2.63e-15 ***
GarageArea 1.235e+01 1.029e+01 1.200 0.23024
YearRemodAdd 4.330e+02 6.005e+01 7.210 9.00e-13 ***
GarageCars 1.237e+04 2.998e+03 4.126 3.90e-05 ***
TotRmsAbvGrd -2.601e+02 1.113e+03 -0.234 0.81525
FullBath -1.813e+03 2.552e+03 -0.710 0.47755
GrLivArea 4.387e+01 4.156e+00 10.557 < 2e-16 ***
X1stFlrSF 1.287e+01 4.915e+00 2.617 0.00896 **
TotalBsmtSF 2.134e+01 4.238e+00 5.036 5.36e-07 ***
OverallQual 1.992e+04 1.167e+03 17.069 < 2e-16 ***
MasVnrArea 3.774e+01 6.272e+00 6.018 2.23e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 37760 on 1441 degrees of freedom
(8 observations deleted due to missingness)
Multiple R-squared: 0.7747, Adjusted R-squared: 0.7732
F-statistic: 495.6 on 10 and 1441 DF, p-value: < 2.2e-16
in order to select the variables that truly have an effect on the response, I’ve used the \(Backward Selection\) method. The method is performed as follows: We start with all the predictors in our model and omit the predictor with the highest \(p-value\). this is the least statistically significant predictor in our model. We then fit the new model using the remaining \(p - 1\) predictors and start again. We keep deleting predictors with high \(p-values\) until we reach predictors with \(p-values\) low enough to be statistically significant.
step 1: Removing TotRmsAbvGrd
re-fitting the model :
quan.df.reg.mod1 <- update(quan.df.reg, . ~ . - TotRmsAbvGrd, na.action = na.exclude)
quan.df.reg.mod1.sum <- summary(quan.df.reg.mod1)
quan.df.pred <- quan.df.pred %>%
mutate(SalePricePredMod1 = fitted(quan.df.reg.mod1, type = "response"),
SalePriceErrorMod1 = residuals.lm(quan.df.reg.mod1))drawing plots:
two.pa <- ggplot(quan.df.pred, aes(x = SalePrice, y = SalePricePredMod1)) +
geom_point() +
geom_smooth()
two.paquan.df.predh <- quan.df.pred %>%
filter(!is.na(SalePrice)) %>%
filter(!is.na(SalePricePredMod1))
hchart(type = "area", density(quan.df.predh$SalePrice), name = "Actual") %>%
hc_add_series(type = "area", data = density(quan.df.predh$SalePricePredMod1), name = "Prediction 1") %>%
hc_add_theme(hc_theme_db())two.ed <- ggplot(quan.df.pred) +
geom_histogram(aes(y = ..density.., x = SalePriceErrorMod1), alpha = 0.5, bins = 150) +
geom_density(aes(x = SalePriceErrorMod1), size = 0.5)
two.edcalculating \(R^2\) and the \(F-statistic\)
quan.df.reg.mod1.sum$r.squared[1] 0.7747344
quan.df.reg.mod1.sum$fstatistic value numdf dendf
551.037 9.000 1442.000
seeing whether or not there are predictors to delete in the next round:
quan.df.reg.mod1.sum
Call:
lm(formula = SalePrice ~ GarageArea + YearRemodAdd + GarageCars +
FullBath + GrLivArea + X1stFlrSF + TotalBsmtSF + OverallQual +
MasVnrArea, data = quan.df, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-480566 -18728 -1121 16064 296091
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.333e+05 1.160e+05 -8.044 1.80e-15 ***
GarageArea 1.252e+01 1.026e+01 1.220 0.22276
YearRemodAdd 4.337e+02 5.995e+01 7.235 7.52e-13 ***
GarageCars 1.232e+04 2.990e+03 4.122 3.98e-05 ***
FullBath -1.881e+03 2.535e+03 -0.742 0.45826
GrLivArea 4.318e+01 2.923e+00 14.773 < 2e-16 ***
X1stFlrSF 1.285e+01 4.914e+00 2.616 0.00899 **
TotalBsmtSF 2.143e+01 4.222e+00 5.074 4.40e-07 ***
OverallQual 1.994e+04 1.163e+03 17.138 < 2e-16 ***
MasVnrArea 3.779e+01 6.267e+00 6.030 2.08e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 37750 on 1442 degrees of freedom
(8 observations deleted due to missingness)
Multiple R-squared: 0.7747, Adjusted R-squared: 0.7733
F-statistic: 551 on 9 and 1442 DF, p-value: < 2.2e-16
step 2: Removing FullBath
re-fitting the model:
quan.df.reg.mod2 <- update(quan.df.reg.mod1, . ~ . - FullBath, na.action = na.exclude)
quan.df.reg.mod2.sum <- summary(quan.df.reg.mod2)
quan.df.pred <- quan.df.pred %>%
mutate(SalePricePredMod2 = fitted(quan.df.reg.mod2, type = "response"),
SalePriceErrorMod2 = residuals.lm(quan.df.reg.mod2))drawing plots :
three.pa <- ggplot(quan.df.pred, aes(x = SalePrice, y = SalePricePredMod2)) +
geom_point() +
geom_smooth()
three.paquan.df.predh <- quan.df.pred %>%
filter(!is.na(SalePrice)) %>%
filter(!is.na(SalePricePredMod2))
hchart(type = "area", density(quan.df.predh$SalePrice), name = "Actual") %>%
hc_add_series(type = "area", data = density(quan.df.predh$SalePricePredMod2), name = "Prediction 2") %>%
hc_add_theme(hc_theme_db())three.ed <- ggplot(quan.df.pred) +
geom_histogram(aes(y = ..density.., x = SalePriceErrorMod2), alpha = 0.5, bins = 150) +
geom_density(aes(x = SalePriceErrorMod2), size = 0.5)
three.edcalculating \(R^2\) and the \(F-statistic\)
quan.df.reg.mod2.sum$r.squared[1] 0.7746484
quan.df.reg.mod2.sum$fstatistic value numdf dendf
620.0411 8.0000 1443.0000
seeing whether or not there are predictors to delete in the next round:
quan.df.reg.mod2.sum
Call:
lm(formula = SalePrice ~ GarageArea + YearRemodAdd + GarageCars +
GrLivArea + X1stFlrSF + TotalBsmtSF + OverallQual + MasVnrArea,
data = quan.df, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-478209 -18973 -1206 15861 296677
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.138e+05 1.130e+05 -8.087 1.28e-15 ***
GarageArea 1.327e+01 1.021e+01 1.299 0.19406
YearRemodAdd 4.235e+02 5.834e+01 7.260 6.31e-13 ***
GarageCars 1.198e+04 2.954e+03 4.056 5.25e-05 ***
GrLivArea 4.225e+01 2.642e+00 15.995 < 2e-16 ***
X1stFlrSF 1.276e+01 4.911e+00 2.598 0.00948 **
TotalBsmtSF 2.161e+01 4.214e+00 5.127 3.34e-07 ***
OverallQual 1.984e+04 1.156e+03 17.167 < 2e-16 ***
MasVnrArea 3.786e+01 6.265e+00 6.044 1.91e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 37740 on 1443 degrees of freedom
(8 observations deleted due to missingness)
Multiple R-squared: 0.7746, Adjusted R-squared: 0.7734
F-statistic: 620 on 8 and 1443 DF, p-value: < 2.2e-16
step 3 : Removing GarageArea
re-fitting the model:
quan.df.reg.mod3 <- update(quan.df.reg.mod2, . ~ . - GarageArea,
na.action = na.exclude)
quan.df.reg.mod3.sum <- summary(quan.df.reg.mod3)
quan.df.pred <- quan.df.pred %>%
mutate(SalePricePredMod3 = fitted(quan.df.reg.mod3, type = "response"),
SalePriceErrorMod3 = residuals.lm(quan.df.reg.mod3))drawing plots :
four.pa <- ggplot(quan.df.pred, aes(x = SalePrice, y = SalePricePredMod3)) +
geom_point() +
geom_smooth()
three.paquan.df.predh <- quan.df.pred %>%
filter(!is.na(SalePrice)) %>%
filter(!is.na(SalePricePredMod3))
hchart(type = "area", density(quan.df.predh$SalePrice), name = "Actual") %>%
hc_add_series(type = "area", data = density(quan.df.predh$SalePricePredMod3), name = "Prediction 3") %>%
hc_add_theme(hc_theme_db())four.ed <- ggplot(quan.df.pred) +
geom_histogram(aes(y = ..density.., x = SalePriceErrorMod3), alpha = 0.5, bins = 150) +
geom_density(aes(x = SalePriceErrorMod3), size = 0.5)
three.edcalculating \(R^2\) and the \(F-statistic\)
quan.df.reg.mod3.sum$r.squared[1] 0.7743848
quan.df.reg.mod3.sum$fstatistic value numdf dendf
708.0398 7.0000 1444.0000
seeing whether or not there are predictors to delete in the next round:
quan.df.reg.mod3.sum
Call:
lm(formula = SalePrice ~ YearRemodAdd + GarageCars + GrLivArea +
X1stFlrSF + TotalBsmtSF + OverallQual + MasVnrArea, data = quan.df,
na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-471039 -18730 -1136 16449 296004
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.111e+05 1.130e+05 -8.062 1.56e-15 ***
YearRemodAdd 4.220e+02 5.834e+01 7.234 7.60e-13 ***
GarageCars 1.509e+04 1.736e+03 8.691 < 2e-16 ***
GrLivArea 4.238e+01 2.641e+00 16.048 < 2e-16 ***
X1stFlrSF 1.317e+01 4.902e+00 2.687 0.00729 **
TotalBsmtSF 2.211e+01 4.198e+00 5.268 1.59e-07 ***
OverallQual 1.980e+04 1.156e+03 17.136 < 2e-16 ***
MasVnrArea 3.830e+01 6.257e+00 6.121 1.20e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 37750 on 1444 degrees of freedom
(8 observations deleted due to missingness)
Multiple R-squared: 0.7744, Adjusted R-squared: 0.7733
F-statistic: 708 on 7 and 1444 DF, p-value: < 2.2e-16
since all of our p-values are less than 0.05, therefor we can reject the null hypothesis that there is no relationship between these predictors and the response (\(H_0: = \beta_p = 0\)) with a significance level of \(\alpha = 0.05\) then we stop here and no longer remove any predictors in this stage.
Quick Visual Comparison
Predicted Vs. Actual :
grid.arrange(one.pa, two.pa, three.pa, four.pa, nrow = 2)Error Density Plots:
grid.arrange(one.ed, two.ed, three.ed, four.ed, nrow = 2)in order to see whether the remaining predictors are colinear (as I suggested in Question 2) I have used the vif() function. the closer the output is to 1 the less a predictor is correlated with the others.
vif(quan.df.reg.mod3)YearRemodAdd GarageCars GrLivArea X1stFlrSF TotalBsmtSF
1.478150 1.718849 1.961472 3.626596 3.443565
OverallQual MasVnrArea
2.594410 1.306956
since none of the values are that high, I won’t change the model :
quan.df.fit <- quan.df.reg.mod3Q7
Constant Variance
The Residuals Vs. Fitted plot on the top and the Standardized Residuals Vs. Fitted plot on the bottom left are a good test for this assumption. The plots show that the residuals differ slightly in variance. If the Residuals had constant variance, then there would be no visible pattern in the red line and the line would be somewhat horizontal. As you can see, this is clearly not the case.
par(mfrow=c(2,2))
plot(quan.df.fit)Normality
This plot (similar to the top right chart in the previous section) shows whether our Residuals have a normal distribution or not. The closer the circles are to the dashed line, the more normal the distribution of the Residuals. As we can see the residuals tend to take distance from the dashed line towards the end. To fix this we can use sqrt() and log()
car::qqPlot(quan.df.fit, id.method="identify",
simulate = TRUE, main="Q-Q Plot")using sqrt():
quan.df.fit.sqrt <- update(quan.df.fit, sqrt(SalePrice) ~ .,
na.action = na.exclude)
car::qqPlot(quan.df.fit.sqrt, id.method="identify",
simulate = TRUE, main="Q-Q Plot")using log():
quan.df.fit.log <- update(quan.df.fit, sqrt(SalePrice) ~ .,
na.action = na.exclude)
car::qqPlot(quan.df.fit.log, id.method="identify",
simulate = TRUE, main="Q-Q Plot")The log() works slightly better in case of normality than sqrt()
Independence
in other words we would like to check whether the residuals are auto-correlated. When the residuals are autocorrelated, it means that the current value is dependent on the previous value. there are a few ways to determine whether the residuals are dependent (auto-correlated) or not. 1. Using acf plot
acf(quan.df.fit.log$residuals)The first line shows the auto-correlation of the residual with itself, therefor it is always equal to one. As we can see, the next line drops below the blue dashed line (significance level) from which we can conclude that the residuals are not auto-correlated(they are independent).
- Using
runstest
In the runs test, the null hypothesis is that the Residuals are random (a value is not affected by its previous value.)
lawstat::runs.test(quan.df.fit.log$residuals)
Runs Test - Two sided
data: quan.df.fit.log$residuals
Standardized Runs Statistic = -0.21002, p-value = 0.8337
Since the \(p-value\) attained from this test is very high, we fail to reject the null hypothesis.
- Using
Durbin-Watsontest.
lmtest::dwtest(quan.df.fit.log)
Durbin-Watson test
data: quan.df.fit.log
DW = 1.9978, p-value = 0.4831
alternative hypothesis: true autocorrelation is greater than 0
All three methods used prove that the assumption of Independence regarding the residuals holds.
Q8:
Dividing the train dataset into training and testing parts:
samp_size = floor(0.8 * nrow(train))
train_ind = sample(seq_len(nrow(train)), size = samp_size)
train.tmp <- train[train_ind, ]
test.tmp <- train[-train_ind, ]fitting the model :
quan.df.reg.tmp <- lm(log(SalePrice) ~ YearRemodAdd + GarageCars
+ GrLivArea + X1stFlrSF + TotalBsmtSF +
OverallQual + MasVnrArea, data = train.tmp)Drawing the plot :
test.tmp <- test.tmp %>%
mutate(SalePricePred = exp(predict(quan.df.reg.tmp,
type = "response",
newdata = test.tmp)),
SalePriceError = (SalePrice - SalePricePred) ^ 2)
ggplot(test.tmp) +
geom_point(aes(x = SalePrice, y = SalePricePred))ggplot(test.tmp) +
geom_histogram(aes(y = ..density.., x = SalePriceError), alpha = 0.5, bins = 150) +
geom_density(aes(x = SalePriceError), size = 0.5) Determining the error of our prediction:
In order to achieve this, I have calculated the Standard Mean Squared Error, \[MSE = \frac{1}{n}\sqrt{\Sigma(\hat{Y_i} - Y_i)^2} \]
MSE <- sqrt(mean(test.tmp$SalePriceError, na.rm = T))
MSE[1] 108427.1
Q9:
Based on the plots drawn in Question 2, I have drawn scatter plots of Sale Price based on predictors that seemed to have a non-linear relationship, which are:OverallQual
based on the chart below, it seems like SalePrice is either a degree 3 polynomial function of OverallQual.
ggplot(quan.df, aes(x = OverallQual, y = SalePrice)) +
geom_count() +
geom_smooth() GarageCars
SalePrice seems to be a degree 2 polynomial function of GarageCars
ggplot(quan.df, aes(x = GarageCars, y = SalePrice)) +
geom_count() X1stFlrSF
X1stFlrSF seems to have a linear relationship with SalePrice up until values higher than 3000. These Values can be considered leverages. Leverages can be dangerous in model fitting. Therefor we can delete them from the training set.
ggplot(quan.df, aes(x = X1stFlrSF, y = SalePrice)) +
geom_point() +
geom_smooth()after removing leverages :
train <- train %>%
filter(X1stFlrSF < 3000)
ggplot(train, aes(x = X1stFlrSF, y = SalePrice)) +
geom_point() +
geom_smooth()Q10:
The Final Model
Based on the calculations above, I have made a few adjustments to the model. Also, I have added a few categorical predictors that seemed to have a a strong effect on the response. Lastly, after observing the data set and the discription of each column, I have added a few extra predictors that seemed to affect the price intuitively, such as: * OverallCond * houseAge * YearBuilt * MSZoning * MSSubClass * CentralAir
quan.df.final <- lm(log(SalePrice) ~
poly(OverallQual, 3) + poly(OverallCond, 3) +
X1stFlrSF +
poly(GarageCars, 2) +
TotalBsmtSF + GrLivArea + houseAge +
YearRemodAdd + YearBuilt + MSZoning + MSSubClass +
CentralAir,
data = train,
na.action = na.exclude)
summary(quan.df.final)
Call:
lm(formula = log(SalePrice) ~ poly(OverallQual, 3) + poly(OverallCond,
3) + X1stFlrSF + poly(GarageCars, 2) + TotalBsmtSF + GrLivArea +
houseAge + YearRemodAdd + YearBuilt + MSZoning + MSSubClass +
CentralAir, data = train, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-0.81005 -0.07033 -0.00134 0.07714 0.55951
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.374e+01 5.230e+00 2.627 0.00870 **
poly(OverallQual, 3)1 4.104e+00 2.242e-01 18.304 < 2e-16 ***
poly(OverallQual, 3)2 8.389e-01 1.535e-01 5.465 5.44e-08 ***
poly(OverallQual, 3)3 1.940e-01 1.445e-01 1.342 0.17966
poly(OverallCond, 3)1 2.192e+00 1.705e-01 12.854 < 2e-16 ***
poly(OverallCond, 3)2 -3.870e-01 1.480e-01 -2.614 0.00903 **
poly(OverallCond, 3)3 -2.027e-01 1.427e-01 -1.421 0.15555
X1stFlrSF 1.583e-06 1.768e-05 0.090 0.92868
poly(GarageCars, 2)1 1.964e+00 1.839e-01 10.678 < 2e-16 ***
poly(GarageCars, 2)2 -3.070e-01 1.472e-01 -2.086 0.03718 *
TotalBsmtSF 1.761e-04 1.549e-05 11.364 < 2e-16 ***
GrLivArea 2.912e-04 1.001e-05 29.104 < 2e-16 ***
houseAge -4.740e-03 2.606e-03 -1.819 0.06916 .
YearRemodAdd 1.161e-03 2.460e-04 4.719 2.60e-06 ***
YearBuilt -2.475e-03 2.615e-03 -0.946 0.34420
MSZoningFV 3.939e-01 4.691e-02 8.395 < 2e-16 ***
MSZoningRH 3.340e-01 5.390e-02 6.197 7.48e-10 ***
MSZoningRL 3.729e-01 4.346e-02 8.580 < 2e-16 ***
MSZoningRM 2.669e-01 4.368e-02 6.111 1.27e-09 ***
MSSubClass -2.197e-04 9.452e-05 -2.325 0.02023 *
CentralAirY 6.609e-02 1.670e-02 3.958 7.94e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1313 on 1436 degrees of freedom
Multiple R-squared: 0.8932, Adjusted R-squared: 0.8918
F-statistic: 600.7 on 20 and 1436 DF, p-value: < 2.2e-16
predicting the Sale Prices for the testing data:
test <- test %>%
mutate(SalePricePred = exp(predict(quan.df.final,
type = "response",
newdata = test)))Kaggle Submission File :
result <- test %>%
select(Id, SalePrice = SalePricePred)
write.csv(result, file = "/Users/deyapple/Documents/result.csv", row.names = F)Kaggle Ranking (#2433):
since some of the computed values were NA, before submitting them to Kaggle, I replaced them with the mean prediction by hand.